Note
Click here to download the full example code
Online ASR with Emformer RNN-T¶
Author: Jeff Hwang, Moto Hira
This tutorial shows how to use Emformer RNN-T and streaming API to perform online speech recognition.
1. Overview¶
Performing online speech recognition is composed of the following steps
Build the inference pipeline Emformer RNN-T is composed of three components: feature extractor, decoder and token processor.
Format the waveform into chunks of expected sizes.
Pass data through the pipeline.
2. Preparation¶
Note
The streaming API requires FFmpeg libraries (>=4.1).
If you are using Anaconda Python distribution,
conda install -c anaconda ffmpeg will install
the required libraries.
When running this tutorial in Google Colab, the following command should do.
!add-apt-repository -y ppa:savoury1/ffmpeg4
!apt-get -qq install -y ffmpeg
import IPython
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)
from torchaudio.prototype.io import Streamer
Out:
1.12.0.dev20220207+cpu
0.11.0.dev20220207+cpu
3. Construct the pipeline¶
Pre-trained model weights and related pipeline components are
bundled as torchaudio.pipelines.RNNTBundle().
We use torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH(),
which is a Emformer RNN-T model trained on LibriSpeech dataset.
bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
feature_extractor = bundle.get_streaming_feature_extractor()
decoder = bundle.get_decoder()
token_processor = bundle.get_token_processor()
Out:
0%| | 0.00/3.81k [00:00<?, ?B/s]
100%|##########| 3.81k/3.81k [00:00<00:00, 17.6MB/s]
Downloading: "https://download.pytorch.org/torchaudio/models/emformer_rnnt_base_librispeech.pt" to /root/.cache/torch/hub/checkpoints/emformer_rnnt_base_librispeech.pt
0%| | 0.00/293M [00:00<?, ?B/s]
2%|2 | 6.88M/293M [00:00<00:04, 71.8MB/s]
5%|4 | 14.2M/293M [00:00<00:03, 74.9MB/s]
7%|7 | 21.4M/293M [00:00<00:03, 74.5MB/s]
10%|# | 30.2M/293M [00:00<00:03, 81.4MB/s]
13%|#3 | 39.3M/293M [00:00<00:03, 86.5MB/s]
16%|#6 | 48.0M/293M [00:00<00:02, 87.8MB/s]
19%|#9 | 56.4M/293M [00:00<00:02, 87.9MB/s]
22%|##2 | 64.8M/293M [00:00<00:02, 87.5MB/s]
25%|##5 | 73.9M/293M [00:00<00:02, 90.0MB/s]
28%|##8 | 82.7M/293M [00:01<00:02, 90.5MB/s]
31%|###1 | 91.8M/293M [00:01<00:02, 92.0MB/s]
34%|###4 | 101M/293M [00:01<00:02, 91.3MB/s]
37%|###7 | 109M/293M [00:01<00:02, 91.5MB/s]
40%|#### | 118M/293M [00:01<00:01, 91.7MB/s]
43%|####3 | 127M/293M [00:01<00:01, 92.7MB/s]
46%|####6 | 136M/293M [00:01<00:01, 89.9MB/s]
49%|####9 | 145M/293M [00:01<00:01, 87.8MB/s]
52%|#####2 | 153M/293M [00:01<00:01, 88.5MB/s]
55%|#####5 | 163M/293M [00:01<00:01, 90.8MB/s]
58%|#####8 | 171M/293M [00:02<00:01, 90.9MB/s]
61%|######1 | 180M/293M [00:02<00:01, 89.5MB/s]
64%|######4 | 189M/293M [00:02<00:01, 90.2MB/s]
67%|######7 | 197M/293M [00:02<00:01, 89.9MB/s]
71%|####### | 206M/293M [00:02<00:00, 91.6MB/s]
74%|#######3 | 216M/293M [00:02<00:00, 92.4MB/s]
77%|#######6 | 224M/293M [00:02<00:00, 91.4MB/s]
80%|#######9 | 233M/293M [00:02<00:00, 91.2MB/s]
83%|########2 | 242M/293M [00:02<00:00, 91.5MB/s]
86%|########5 | 251M/293M [00:02<00:00, 89.4MB/s]
88%|########8 | 259M/293M [00:03<00:00, 89.0MB/s]
92%|#########1| 268M/293M [00:03<00:00, 90.8MB/s]
95%|#########4| 277M/293M [00:03<00:00, 91.3MB/s]
98%|#########7| 286M/293M [00:03<00:00, 91.6MB/s]
100%|##########| 293M/293M [00:03<00:00, 89.6MB/s]
0%| | 0.00/295k [00:00<?, ?B/s]
100%|##########| 295k/295k [00:00<00:00, 113MB/s]
Streaming inference works on input data with overlap. Emformer RNN-T expects right context like the following.
The size of main segment and right context, along with the expected sample rate can be retrieved from bundle.
sample_rate = bundle.sample_rate
frames_per_chunk = bundle.segment_length * bundle.hop_length
right_context_size = bundle.right_context_length * bundle.hop_length
print(f"Sample rate: {sample_rate}")
print(f"Main segment: {frames_per_chunk} frames ({frames_per_chunk / sample_rate} seconds)")
print(f"Right context: {right_context_size} frames ({right_context_size / sample_rate} seconds)")
Out:
Sample rate: 16000
Main segment: 2560 frames (0.16 seconds)
Right context: 640 frames (0.04 seconds)
4. Configure the audio stream¶
Next, we configure the input audio stream using Streamer().
For the detail of this API, please refer to the Media Stream API tutorial.
The following audio file was originally published by LibriVox project, and it is in the public domain.
https://librivox.org/great-pirate-stories-by-joseph-lewis-french/
It was re-uploaded for the sake of the tutorial.
src = "https://download.pytorch.org/torchaudio/tutorial-assets/greatpiratestories_00_various.mp3"
streamer = Streamer(src)
streamer.add_basic_audio_stream(frames_per_chunk=frames_per_chunk, sample_rate=bundle.sample_rate)
print(streamer.get_src_stream_info(0))
print(streamer.get_out_stream_info(0))
Out:
SourceAudioStream(media_type='audio', codec='mp3', codec_long_name='MP3 (MPEG audio layer 3)', format='fltp', bit_rate=128000, sample_rate=44100.0, num_channels=2)
OutputStream(source_index=0, filter_description='aresample=16000,aformat=sample_fmts=fltp')
Streamer iterate the source media without overlap, so we make a helper structure that caches a chunk and return it with right context appended when the next chunk is given.
class ContextCacher:
"""Cache the previous chunk and combine it with the new chunk
Args:
chunk (torch.Tensor): Initial chunk
right_context_size (int): The size of right context.
"""
def __init__(self, chunk: torch.Tensor, right_context_size: int):
self.chunk = chunk
self.right_context_size = right_context_size
def __call__(self, chunk: torch.Tensor):
right_context = chunk[: self.right_context_size, :]
chunk_with_context = torch.cat((self.chunk, right_context))
self.chunk = chunk
return chunk_with_context
5. Run stream inference¶
Finally, we run the recognition.
First, we initialize the stream iterator, context cacher, and state and hypothesis that are used by decoder to carry over the decoding state between inference calls.
stream_iterator = streamer.stream()
cacher = ContextCacher(next(stream_iterator)[0], right_context_size)
state, hypothesis = None, None
Next we, run the inference.
For the sake of better display, we create a helper function which processes the source stream up to the given times and call it repeatedly.
@torch.inference_mode()
def run_inference(num_iter=200):
global state, hypothesis
chunks = []
for i, (chunk,) in enumerate(stream_iterator, start=1):
segment = cacher(chunk).T[0]
features, length = feature_extractor(segment)
hypos, state = decoder.infer(features, length, 10, state=state, hypothesis=hypothesis)
hypothesis = hypos[0]
transcript = token_processor(hypothesis.tokens, lstrip=False)
print(transcript, end="", flush=True)
chunks.append(chunk)
if i == num_iter:
break
return IPython.display.Audio(torch.cat(chunks).T.numpy(), rate=bundle.sample_rate)
run_inference()
Out:
forward great pirate's this is aver vice recordings are in the public domain for more information or please visitor recording by james christoper great pirite stories by various eded by joseph's fordie emboys the romance of the sea in its highest expression it is a sad but inevable commentary on our civilization that so far as the sea is concerned it
run_inference()
Out:
is developed from its infancy down to a century or so ago under one phase or another of pircy if men were savages on land they were doubly so at sea and all the years oftime adventure years that added to theap world there was little left to discover could not wholly eradic theat germ it went out gradually with the settlement and ordering of the far british colonies great britain foremost of sea powers must bered doing more both directly and indirectorally for the abolition of crime and disord on the high seas than any other
run_inference()
Out:
force but the conquest was not complete till the steam which chased the rover into the furthesters of his domain it is said that he survives even to day in certain spots in the chines but he is certainly a pir of any sort would be as great a curiosity to day if he could be caught and exhibited as a fab the fact remains and will always persist that lore of the sea is far away the most picturesque figure in the more genuine gross his character the higher degree of interest as he inspire there
run_inference()
Out:
may be a certain perversity in this for the pirate was unquestionably a bad man at his best or worst considering his surroundings and conditions undoubtily the worst man that ever lived there is little to soften the dark yet glowing picture of his exploits but again it must be remembered that only does the note of distant subdue and even lend a certain enchant to the scene but the effective contrast between our peaceful times andributes much to deepen our interest in him perhaps it is this latter added to that death wasp on the human breast that
run_inference()
Out:
gloves at the tale which makes them the kind of hero of romance that is to day he is undially a redoubtable historical figure it is a curious fact that commer seas cradled in the lap of bucci the constant danger of thes in this form only made heartier mariners out of the merchant adventurers actually stimating and strengthening marits bucc is only a polite for piry thus became the high romance of theas during the great centuries oftime adventure it went in hand with discovery
run_inference()
Out:
they were in fact almost inseparable most of the mighty mariners from the days of the discoverer through those of the redoubtable sir francis drake down to her own jones answered to the roll it was a bold hearty world this avarice up to the advent of our giant's steam every foot of which was won my fierce conquest of one sort or another out of this passed the pir emerges are romantic even at times heroic figure this final nic despite his crimes cannot altogether be denied a hero he is and will remain
run_inference()
Out:
so long as tales of theer told so have at him in these pages joseth lewherents and of recording james christopher jist christopher at yucha come
Total running time of the script: ( 1 minutes 0.864 seconds)